34 research outputs found
Depth Assisted Full Resolution Network for Single Image-based View Synthesis
Researches in novel viewpoint synthesis majorly focus on interpolation from
multi-view input images. In this paper, we focus on a more challenging and
ill-posed problem that is to synthesize novel viewpoints from one single input
image. To achieve this goal, we propose a novel deep learning-based technique.
We design a full resolution network that extracts local image features with the
same resolution of the input, which contributes to derive high resolution and
prevent blurry artifacts in the final synthesized images. We also involve a
pre-trained depth estimation network into our system, and thus 3D information
is able to be utilized to infer the flow field between the input and the target
image. Since the depth network is trained by depth order information between
arbitrary pairs of points in the scene, global image features are also involved
into our system. Finally, a synthesis layer is used to not only warp the
observed pixels to the desired positions but also hallucinate the missing
pixels with recorded pixels. Experiments show that our technique performs well
on images of various scenes, and outperforms the state-of-the-art techniques
Quasiparticle interference of C2-symmetric surface states in LaOFeAs parent compound
We present scanning tunneling microscopy studies of the LaOFeAs parent
compound of iron pnictide superconductors. Topographic imaging reveals two
types of atomically flat surfaces, corresponding to the exposed LaO layer and
FeAs layer respectively. On one type of surface, we observe strong standing
wave patterns induced by quasiparticle interference of two-dimensional surface
states. The distribution of scattering wavevectors exhibits pronounced two-fold
symmetry, consistent with the nematic electronic structure found in the
Ca(Fe1-xCox)2As2 parent state.Comment: 13 pages, 4 figure
Towards Ghost-free Shadow Removal via Dual Hierarchical Aggregation Network and Shadow Matting GAN
Shadow removal is an essential task for scene understanding. Many studies
consider only matching the image contents, which often causes two types of
ghosts: color in-consistencies in shadow regions or artifacts on shadow
boundaries. In this paper, we tackle these issues in two ways. First, to
carefully learn the border artifacts-free image, we propose a novel network
structure named the dual hierarchically aggregation network~(DHAN). It contains
a series of growth dilated convolutions as the backbone without any
down-samplings, and we hierarchically aggregate multi-context features for
attention and prediction, respectively. Second, we argue that training on a
limited dataset restricts the textural understanding of the network, which
leads to the shadow region color in-consistencies. Currently, the largest
dataset contains 2k+ shadow/shadow-free image pairs. However, it has only 0.1k+
unique scenes since many samples share exactly the same background with
different shadow positions. Thus, we design a shadow matting generative
adversarial network~(SMGAN) to synthesize realistic shadow mattings from a
given shadow mask and shadow-free image. With the help of novel masks or
scenes, we enhance the current datasets using synthesized shadow images.
Experiments show that our DHAN can erase the shadows and produce high-quality
ghost-free images. After training on the synthesized and real datasets, our
network outperforms other state-of-the-art methods by a large margin. The code
is available: http://github.com/vinthony/ghost-free-shadow-removal/Comment: Accepted by AAAI 202
High-Resolution Document Shadow Removal via A Large-Scale Real-World Dataset and A Frequency-Aware Shadow Erasing Net
Shadows often occur when we capture the documents with casual equipment,
which influences the visual quality and readability of the digital copies.
Different from the algorithms for natural shadow removal, the algorithms in
document shadow removal need to preserve the details of fonts and figures in
high-resolution input. Previous works ignore this problem and remove the
shadows via approximate attention and small datasets, which might not work in
real-world situations. We handle high-resolution document shadow removal
directly via a larger-scale real-world dataset and a carefully designed
frequency-aware network. As for the dataset, we acquire over 7k couples of
high-resolution (2462 x 3699) images of real-world document pairs with various
samples under different lighting circumstances, which is 10 times larger than
existing datasets. As for the design of the network, we decouple the
high-resolution images in the frequency domain, where the low-frequency details
and high-frequency boundaries can be effectively learned via the carefully
designed network structure. Powered by our network and dataset, the proposed
method clearly shows a better performance than previous methods in terms of
visual quality and numerical results. The code, models, and dataset are
available at: https://github.com/CXH-Research/DocShadow-SD7KComment: Accepted by International Conference on Computer Vision 2023 (ICCV
2023
Explicit Visual Prompting for Universal Foreground Segmentations
Foreground segmentation is a fundamental problem in computer vision, which
includes salient object detection, forgery detection, defocus blur detection,
shadow detection, and camouflage object detection. Previous works have
typically relied on domain-specific solutions to address accuracy and
robustness issues in those applications. In this paper, we present a unified
framework for a number of foreground segmentation tasks without any
task-specific designs. We take inspiration from the widely-used pre-training
and then prompt tuning protocols in NLP and propose a new visual prompting
model, named Explicit Visual Prompting (EVP). Different from the previous
visual prompting which is typically a dataset-level implicit embedding, our key
insight is to enforce the tunable parameters focusing on the explicit visual
content from each individual image, i.e., the features from frozen patch
embeddings and high-frequency components. Our method freezes a pre-trained
model and then learns task-specific knowledge using a few extra parameters.
Despite introducing only a small number of tunable parameters, EVP achieves
superior performance than full fine-tuning and other parameter-efficient
fine-tuning methods. Experiments in fourteen datasets across five tasks show
the proposed method outperforms other task-specific methods while being
considerably simple. The proposed method demonstrates the scalability in
different architectures, pre-trained weights, and tasks. The code is available
at: https://github.com/NiFangBaAGe/Explicit-Visual-Prompt.Comment: arXiv admin note: substantial text overlap with arXiv:2303.1088
LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation
Gestures are non-verbal but important behaviors accompanying people's speech.
While previous methods are able to generate speech rhythm-synchronized
gestures, the semantic context of the speech is generally lacking in the
gesticulations. Although semantic gestures do not occur very regularly in human
speech, they are indeed the key for the audience to understand the speech
context in a more immersive environment. Hence, we introduce LivelySpeaker, a
framework that realizes semantics-aware co-speech gesture generation and offers
several control handles. In particular, our method decouples the task into two
stages: script-based gesture generation and audio-guided rhythm refinement.
Specifically, the script-based gesture generation leverages the pre-trained
CLIP text embeddings as the guidance for generating gestures that are highly
semantically aligned with the script. Then, we devise a simple but effective
diffusion-based gesture generation backbone simply using pure MLPs, that is
conditioned on only audio signals and learns to gesticulate with realistic
motions. We utilize such powerful prior to rhyme the script-guided gestures
with the audio signals, notably in a zero-shot setting. Our novel two-stage
generation framework also enables several applications, such as changing the
gesticulation style, editing the co-speech gestures via textual prompting, and
controlling the semantic awareness and rhythm alignment with guided diffusion.
Extensive experiments demonstrate the advantages of the proposed framework over
competing methods. In addition, our core diffusion-based generative model also
achieves state-of-the-art performance on two benchmarks. The code and model
will be released to facilitate future research.Comment: Accepted by ICCV 202
Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos
Generating text-editable and pose-controllable character videos have an
imperious demand in creating various digital human. Nevertheless, this task has
been restricted by the absence of a comprehensive dataset featuring paired
video-pose captions and the generative prior models for videos. In this work,
we design a novel two-stage training scheme that can utilize easily obtained
datasets (i.e.,image pose pair and pose-free video) and the pre-trained
text-to-image (T2I) model to obtain the pose-controllable character videos.
Specifically, in the first stage, only the keypoint-image pairs are used only
for a controllable text-to-image generation. We learn a zero-initialized
convolu- tional encoder to encode the pose information. In the second stage, we
finetune the motion of the above network via a pose-free video dataset by
adding the learnable temporal self-attention and reformed cross-frame
self-attention blocks. Powered by our new designs, our method successfully
generates continuously pose-controllable character videos while keeps the
editing and concept composition ability of the pre-trained T2I model. The code
and models will be made publicly available.Comment: Project page: https://follow-your-pose.github.io/; Github repository:
https://github.com/mayuelala/FollowYourPos